System and method for using XML to normalize documents

ABSTRACT

A system, method, and processor readable medium for normalizing documents using extensible markup language (XML). The system may determine a type of object repository storing at least one object. The object may include metadata. The system may then identify the object stored in the object repository. At least one portion of the one object may be extracted from the repository, wherein the portion is extracted in extensible markup language (XML) format. Preferably, some of the metadata is preserved. The metadata preserved may include at least one of author, title, subject, date created, date modified, list of modifiers, and link list information. The portion may then be transmitted to a processor. The processor may perform one or more processes on the portion. A mapping may be performed that maps at least one field in the object with a field designation identifier. The processor may include at least one of a full-text engine, a metrics engine, and a taxonomy engine.

RELATED APPLICATIONS

[0001] This application claims priority from U.S. Provisional PatentApplication Serial No. ______, filed Jan. 14, 2002, titled, “KnowledgeServer,” Attorney Docket No. 23452-500-301, which is hereby incorporatedby reference. This application is related to co-pending patentapplication titled “System and Method for Processing Data in aDistributed Architecture,” Attorney Docket No. 23452-504, filedconcurrently, which is hereby incorporated by reference.

FIELD OF THE INVENTION

[0002] The invention relates to a system and method for normalizingdocuments using extensible markup language (XML).

BACKGROUND OF THE INVENTION

[0003] Knowledge management systems are known. Knowledge managementsystems may be used to collect information from information systemswithin an organization. The knowledge management system may perform oneor more processing actions on the information, such as, for example,categorization, full-text indexing, and metrics extraction. Each ofthese processes, however, are typically performed synchronously.Therefore, the information may only be available in each informationsystem at varying times. A particular information system may be updatedwith other information and the information system may not be accessiblefor an extended period of time. This results in higher development costsand extended customer disruptions.

[0004] Current knowledge management systems typically use a singleprocess for performing one or more processes on information collectedfrom the information systems. Therefore, if an information system fails,information may be lost. This is a drawback.

[0005] These and other drawbacks exist.

SUMMARY OF THE INVENTION

[0006] An object of the invention is to overcome these and otherdrawbacks of existing systems.

[0007] Another object of the invention is to provide a system and methodfor normalizing documents using extensible markup language (XML).

[0008] Another object of the invention is to provide a system and methodfor normalizing documents using XML that enables meta data in a documentto be preserved.

[0009] Another object of the invention is to provide a system and methodfor normalizing documents using XML that maps fields within the documentwith at least one field designation identifier.

[0010] Another object of the invention is to provide a system and methodfor normalizing documents using XML that provides full-text indexing,categorizing, and metrics extraction.

[0011] Another object of the invention is to provide a system and methodfor processing data that performs one or more processes information inan asynchronous manner.

[0012] Another object of the invention is to provide a system and methodfor data processing that processes information in a parallel manner.

[0013] Another object of the invention is to provide a system and methodfor data processing that enables recovery of information in the event ofa system failure.

[0014] These and other objects of the invention are achieved accordingto various embodiments of the invention. A system, method, and processorreadable medium for normalizing documents using extensible markuplanguage (XML). The system may determine a type of object repositorystoring at least one object. The object may include metadata. The systemmay then identify the object stored in the object repository. At leastone portion of the one object may be extracted from the repository,wherein the portion is extracted in extensible markup language (XML)format. Preferably, some of the metadata is preserved. The metadatapreserved may include at least one of author, title, subject, datecreated, date modified, list of modifiers, and link list information.The portion may then be transmitted to a processor. The processor mayperform one or more processes on the portion. A mapping may be performedthat maps at least one field in the object with a field designationidentifier. The processor may include at least one of a full-textengine, a metrics engine, and a taxonomy engine.

[0015] These and other objects, features and advantages of the inventionwill be readily apparent to those having ordinary skill in the pertinentart from the detailed descriptions of the embodiments with reference tothe appropriate figures below.

BRIEF DESCRIPTIONS OF THE DRAWINGS

[0016]FIG. 1 is a schematic block diagram of a system for dataprocessing according to one embodiment of the invention.

[0017]FIG. 2 is a schematic block diagram of the method for dataprocessing according to one embodiment of the invention.

[0018]FIG. 3 is a schematic block diagram of data processing accordingto one embodiment of the invention.

[0019]FIG. 4 is a schematic block diagram of a method for normalizingdocuments in XML format according to one embodiment of the invention.

[0020]FIG. 5 is a schematic block diagram of a system for normalizingdocuments using XML according to one embodiment of the invention.

DETAILED DESCRIPTION OF PREFERRED EMBODIMENTS

[0021] A system, method, and processor readable medium for processingdata in a knowledge management system is disclosed. The system mayasynchronously process data such that multiple processes are performedsimultaneously. The system may perform categorization, full-textindexing, and metrics extraction, or other process simultaneously, suchthat a repository is maintained with current information and in theevent of a failure, the likelihood of recovering information is greater.

[0022]FIG. 1 illustrates a system 100 for asynchronously processing dataaccording to one of the embodiment of the invention. System 100 mayinclude repositories 102 a-102 n. Repositories 102 a-102 n may be incommunication with spider component 104. A spider component may be, forexample, a Domino add-in process that invokes threads to exploredifferent repositories. Different spider types may be designed toextract content from various content repository types. Once a spiderprocess is started, spider component 104 may start any number ofadditional threads to explore different repositories, including, forexample, Lotus Notes™, Lotus QuickPlace™, Domino.Doc, electronic mail(Lotus Domino™), Web and file system. This enables one server to use aLotus Notes™ spider and a second server to use a Lotus Notes™ and filesystem spider.

[0023] Spider component 104 may be in communication with a scheduler106, content map component 108, taxonomy engine 110, full-text engine112, and metrics engine 114. Spider component 104 may also communicatewith content map 108, taxonomy engine 110, full-text engine 112, andmetrics engine 114 to update and make available information stored inrepositories 102 a-102 n in a variety of formats. Spider component 104may receive work requests, on a scheduled basis, from scheduler 106 thatdescribe which repositories to process on a work queue. The schedule maybe hourly, daily, weekly, or other basis. The work requests may also bedispatched on a random basis. Scheduler 106 may communicate with arepository schedule 116 for determining when a particular process isscheduled. The repository schedule may detail a type and frequency ofspidering for one or more repositories. For example, the repositoryschedule may identify that repository 102 a is full-text indexed on adaily basis and repository 102 b has a categorization and metricsextraction performed hourly.

[0024] The processes may occur asynchronously. For example, content map108 may process information in repositories 102 a-102 n such that a mapof all content stored in repositories 102 a-102 n is provided. A replicaof the content map may be stored as content replica 118. Taxonomy engine110 may be used to determine categories of information stored inrepositories 102 a-102 n. Full-text engine 112 may be used to provide afull-text index of information stored in repositories 102 a-102 n.Full-text engine 112 may communicate with full-text replica 120 that maybe used as a backup for information provided by a full-text engine 112.Metric engine 114 may be used to extract metrics information frominformation stored in repositories 102 a-102 n. Taxonomy engine 110,full-text engine 112, and metrics engine 114 may be in communicationwith content map 108. Therefore, content map 108 may include a map ofall information stored in repositories 102 a-102 n, categories ofinformation stored in repositories 102 a-102 n, a full-text index ofinformation stored in repositories 102 a-102 n, and metrics informationfor information stored in repositories 102 a-102 n.

[0025] Content map 108, taxonomy engine 110, full-text engine 112, andmetric engine 114 preferably operate in an asynchronous manner. Thisenables each of content map 108, taxonomy engine 100, full-text engine112, and metric engine 114 to operate independently. Content map 108,taxonomy engine 110, full-text engine 112, and metric engine 114preferably do not rely on each other to perform a particular process.This enables information to be available to users because of a reductionin downtime. Additionally, each of content map 108, taxonomy engine 110,full-text engine 112, and metric engine 114 may be decoupled andreplaced individually, thus reducing development costs.

[0026] A knowledge management system may be made more reliable by makingthe failure of a subsystem more recoverable. Scheduler 106 may include aprotocol that handles a failure or shutdown of spider component 104. Theprotocol may be used to enable spider component to transmit a context onshutdown to scheduler 106. The context may then be transmitted back tospider component 104 when spider component 104 resumes functioning. Thisenables spider component 104 to resume processing work requests from anintermediate state. Any information regarding a failure or shutdown maybe transmitted via a completion work queue. The work queues may includecontent map 108, taxonomy engine 110, full-text engine 112, and metricsengine 114. The system may also be more fault tolerant by separatingvarious functions into various processes that may be run independently.

[0027]FIG. 2 illustrates a method for processing data in a knowledgemanagement system according to one embodiment of the invention.Information content may be gathered for data processing by a spidercomponent, step 202. The spider component may also register theinformation content gathered with a content map, step 204. The contentmap may assign the information content gathered a unique identifier,step 206. The spider component may transmit work requests to, forexample, a taxonomy engine, full-text engine or metrics engine,regarding the information content gathered, step 208. The one or moreengines may refer to the information content gathered using the uniqueidentifier. The unique identifier may be a part of an extensible markuplanguage (XML) meta-document representation (described in further detailbelow) that may be transmitted to system users.

[0028] The work requests may then be processed, step 210. The workrequest, may be, for example, processing the repository from which theinformation content is gathered and converting documents stored in therepository into a standard meta-document representation in XML format.The process of converting the document into a standard meta-document isdescribed in further detail with reference to FIG. 4 below.

[0029] The spider component may transmit control messages to systemusers advising of a start and finish of a work request, step 212. Thecontrol messages preferably do not contain any XML content. Themeta-document representations may then be transmitted to a designatedmodule for predetermined processing, step 214. The modules may be, forexample, a content map, taxonomy engine, full-text indexing engine, anda metrics engine. The modules may then process the meta-documents, step216. The processing of the meta-documents may vary depending on themodule performing the processing. For example, a content map maygenerate a map of the information content stored in a repository. Ataxonomy engine may assign categories to the information content storedin a repository. A full-text indexing engine may generate a full-textindex for information content stored in a repository. A metrics enginemay extract metrics information from the information content stored intheir repository and store only the metrics information. The processesmay be performed asynchronously such that each module operatesindependently and may perform processes in a parallel manner. In thismanner, a greater amount of information content in a repository is madeavailable to users at least because the knowledge management system hasless downtime for processing information content stored in a repository.

[0030] After the meta-documents are processed, the meta-documents may beanalyzed, step 218. The analysis may be to determine a type ofinformation content stored in a repository. The meta-documents may alsobe indexed, step 220.

[0031] Progress statistics may also be generated for each of theprocesses, step 222. The progress statistics may be presented in one ormore reports and generated by a spider component and a work queue. Theprogress statistics may be transmitted to a scheduler component via acompletion work queue, step 224. The scheduler component may read theprogress statistics and update any corresponding statistics in arepository schedule. The scheduler component may also update a logdatabase with any warnings or errors generated by a work queue. Eachmodule may then be enabled with shared access to a central datastructure representing the metrics history and taxonomy or otherinformation via a CORBA service, step 226.

[0032]FIG. 3 illustrates a system for processing data in a knowledgemanagement system according to one embodiment of the invention. Thesystem may include an information content gathering module 302,information content registering module 304, document identifierassigning module 306, work request transmitting module 308, work requestprocessing module 310, control message transmitting module 312,information content transmitting module 314, information contentprocessing module 316, information content analyzing module 318,information content indexing module 320, progress statistics generatingmodule 322, progress statistics transmitting module 324, and accesssharing module 326.

[0033] Information content gathering module 302 may be used to gatherinformation content from one or more repositories based on a repositoryschedule. The repository schedule may identify a type and frequency withwhich to gather the information content. Information content registeringmodule 304 may be used to register the information content gatheredwith, for example, a content map. Document identifier assigning module306 may then assign the information content gathered one or more uniquedocument identifiers that may be used by, for example, other modules forretrieving and identifying the information content. A work requestregarding the information content gathered may be transmitted to apersistent work queue using work request transmitting module 308. Thework requests may then be processed for the repository from which theinformation content was gathered using work request processing module310. Work request processing module 310 may include converting documentsstored in a repository into a standard meta-document representation inextensible markup language (XML) first. Control message transmittingmodule 312 may be used to transmit control messages to one or more usersthat provide a status regarding work requests. The control messages mayidentify a start and/or finish of a work request or other information.

[0034] The meta-documents may then be transmitted to a processing workqueue for further processing using information content transmittingmodule 314. The processing may be, for example, full-text indexing,categorization, metrics extraction, or other process. The documents maybe processed using information content processing module 316.

[0035] After processing the meta-documents, the meta-documents may beanalyzed using information content analyzing module 318. The analysismay include determining a type of information stored in the repository.The meta-documents may also be indexed using information contentindexing module 320.

[0036] Progress statistics regarding the processes performed on theinformation content gathered may be generated using progress statisticsgenerating module 322. The progress statistics may be generated in oneor more reports. The progress statistics may be transmitted to othercomponents in a knowledge management system using progress statisticstransmitting module 324. All components within the knowledge managementsystem may be provided with shared access to a central data structurerepresenting the metrics history and taxonomy of the information contentvia a CORBA service using access sharing module 326.

[0037]FIG. 4 illustrates a method for processing a work requestaccording to one embodiment of the invention. A work request may beprocessed by determining a repository type from which informationcontent is gathered, step 402. The document may then be identified, step404. The document may then be extracted from the repository in XMLformat, step 406. The document extracted may be a meta-document. Themeta-document may include metrics information from the document. Forexample, the document may include author, title, subject, date created,date modified, list of modifiers, links list information, and otherinformation. The meta-document may be transmitted to a work queue forfurther processing, step 408. The meta-document may then be processedaccording to a pre-determined process for the work queue, step 410. Thework queue may, for example, categorize, full-text index, or performother process on the meta-document. Fields within the meta-document maybe mapped with a field identifier, step 412. For example, an author of adocument may be mapped with an author field, a creation date may bemapped with a date created field, a title may be mapped with a titlefield, and other metrics information may be mapped with a correspondingfield designation identifier.

[0038]FIG. 5 illustrates a system for processing a work requestaccording to one embodiment of the invention. The system may includerepository type determining module 502, document identifying module 504,document extracting module 506, document transmitting module 508,document processing module 510, and field mapping module 512.

[0039] Repository type determining module 502 may determine a repositorytype from which a document may be gathered. Document identifying module504 may identify the document to be collected from the repository.Document extracting module 506 may extract the document from therepository. Document extracting module 506, however, may convert thedocument stored in the repository into a standard meta-documentrepresentation in an XML format. The meta-document may include meta-dataregarding the document. For example, the meta-document may includeauthor, title, subject, date created, date modified, list of modifiers,and links list information.

[0040] The meta-document may then be transmitted to a work queue forprocessing using document transmitting module 508. The meta-document maythen be processed according to a process designated for a particularwork queue using document processing module 510. The processes mayinclude, for example, categorization, full-text indexing, metricsextraction or other process. Field mapping module 512 may be used to mapfields in the meta-document with a field designation identifier. Forexample, author, title, and subject information may be mapped with anauthor field, title field, and subject field, respectively. Other fieldsmay also be mapped.

[0041] Other embodiments and uses of the invention will be apparent tothose skilled in the art in consideration of the specification andpractice of the invention is disclosed herein. The specification andexamples should be considered exemplary only. For example, although theinvention has been described in terms of a document, a document may beany or current document that may be categorized; for example, electronicmail messages, graphic files, or other type of electronic document.Additionally, although the invention has been described in terms ofmultiple modules, fewer or a greater number of modules may be used andmodules may not be provided in the same location. The scope of theinvention is only limited by the claims appended hereto.

What is claimed is:
 1. A method for using extensible markup language tonormalize documents, the method comprising the steps of: determining atype of object repository storing at least one object, the objectcomprising metadata; identifying the at least one object stored in theat least one object repository; extracting at least one portion of theat least one object, wherein the at least one portion is extracted inextensible markup language (XML) format; and transmitting the at leastone portion to a processor; and processing the at least one portion. 2.The method of claim 1, wherein some of the metadata is preserved.
 3. Themethod of claim 2, wherein the metadata that is preserved includes atleast one of author, title, subject, date created, date modified, listof modifiers, and link list information.
 4. The method of claim 1,further comprising the step of: mapping at least one field in the atleast one object with a field designation identifier.
 5. The method ofclaim 1, wherein the processor comprises at least one of a fill-textengine, a metrics engine, and a taxonomy engine.
 6. A system for usingextensible markup language to normalize documents, the systemcomprising: a determining module that determines a type of objectrepository storing at least one object, the object comprising metadata;an identifying module that identifies the at least one object stored inthe at least one object repository; an extracting module that extractsat least one portion of the at least one object, wherein the at leastone portion is extracted in extensible markup language (XML) format; anda transmitting module that transmits the at least one portion to aprocessor; and a processing module that processes the at least oneportion.
 7. The system of claim 6, wherein some of the metadata ispreserved.
 8. The system of claim 7, wherein the metadata that ispreserved includes at least one of author, title, subject, date created,date modified, list of modifiers, and link list information.
 9. Thesystem of claim 6, further comprising: a mapping module that maps atleast one field in the at least one object with a field designationidentifier.
 10. The system of claim 1, wherein the processing modulecomprises at least one of a full-text engine, a metrics engine, and ataxonomy engine.
 11. A system for using extensible markup language tonormalize documents, the system comprising: determining means fordetermining a type of object repository storing at least one object, theobject comprising metadata; identifying means for identifying the atleast one object stored in the at least one object repository;extracting means for extracting at least one portion of the at least oneobject, wherein the at least one portion is extracted in extensiblemarkup language (XML) format; and transmitting means for transmittingthe at least one portion to a processor; and processing means forprocessing the at least one portion.
 12. The system of claim 11, whereinsome of the metadata is preserved.
 13. The system of claim 12, whereinthe metadata that is preserved includes at least one of author, title,subject, date created, date modified, list of modifiers, and link listinformation.
 14. The system of claim 11, further comprising: mappingmeans for mapping at least one field in the at least one object with afield designation identifier.
 15. The system of claim 11, wherein theprocessing means comprises at least one of a means for full-textindexing the at least one object, means for extracting metricsinformation from the at least one object, and means for categorizing theat least one object.
 16. A processor readable medium comprisingprocessor readable code for causing a processor to use extensible markuplanguage to normalize documents, the medium comprising: determining codethat causes a processor to determine a type of object repository storingat least one object, the object comprising metadata; identifying codethat causes a processor to identify the at least one object stored inthe at least one object repository; extracting code that causes aprocessor to extract at least one portion of the at least one object,wherein the at least one portion is extracted in extensible markuplanguage (XML) format; transmitting code that causes a processor totransmit the at least one portion to a processor; and processing codethat causes a processor to process the at least one portion.
 17. Themedium of claim 16, wherein some of the metadata is preserved.
 18. Themedium of claim 17, wherein the metadata that is preserved includes atleast one of author, title, subject, date created, date modified, listof modifiers, and link list information.
 19. The medium of claim 16,further comprising: mapping code that causes a processor to map at leastone field in the at least one object with a field designationidentifier.
 20. The medium of claim 16, wherein the processing codecomprises at least one of a full-text engine, a metrics engine, and ataxonomy engine.